{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Formatting and deduping data\n", "\n", "Formatting columns and removing duplicates is an important part of data preparation.\n", "\n", "Preparing data for analysis is a crucial step in any data science project. One aspect of data preparation is formatting columns and removing duplicates. Inaccurate or inconsistent formatting of columns can make it difficult to analyze data or even result in incorrect results. Similarly, duplicate data can skew analysis and lead to inaccurate conclusions. \n", "\n", "This notebook will explore how to format columns in Pandas dataframes to ensure data accuracy and consistency. We will also discuss detecting and removing duplicate data and handling missing values in columns. These techniques ensure data is adequately prepared for analysis and modelling, leading to more accurate and reliable results." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How To" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\"data/housing.csv\", dtype={\"housing_median_age\": int,\"ocean_proximity\": \"category\"})" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "longitude float64\n", "latitude float64\n", "housing_median_age int32\n", "total_rooms float64\n", "total_bedrooms float64\n", "population float64\n", "households float64\n", "median_income float64\n", "median_house_value float64\n", "ocean_proximity category\n", "dtype: object" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximity
0-122.2337.8841880.0129.0322.0126.08.3252452600.0NEAR BAY
1-122.2237.86217099.01106.02401.01138.08.3014358500.0NEAR BAY
2-122.2437.85521467.0190.0496.0177.07.2574352100.0NEAR BAY
3-122.2537.85521274.0235.0558.0219.05.6431341300.0NEAR BAY
4-122.2537.85521627.0280.0565.0259.03.8462342200.0NEAR BAY
\n", "
" ], "text/plain": [ " longitude latitude housing_median_age total_rooms total_bedrooms \\\n", "0 -122.23 37.88 41 880.0 129.0 \n", "1 -122.22 37.86 21 7099.0 1106.0 \n", "2 -122.24 37.85 52 1467.0 190.0 \n", "3 -122.25 37.85 52 1274.0 235.0 \n", "4 -122.25 37.85 52 1627.0 280.0 \n", "\n", " population households median_income median_house_value ocean_proximity \n", "0 322.0 126.0 8.3252 452600.0 NEAR BAY \n", "1 2401.0 1138.0 8.3014 358500.0 NEAR BAY \n", "2 496.0 177.0 7.2574 352100.0 NEAR BAY \n", "3 558.0 219.0 5.6431 341300.0 NEAR BAY \n", "4 565.0 259.0 3.8462 342200.0 NEAR BAY " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "int_cols = [\"total_rooms\", \"population\", \"households\", \"median_house_value\"]\n", "df[int_cols] = df[int_cols].astype(int)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "longitude float64\n", "latitude float64\n", "housing_median_age int32\n", "total_rooms int32\n", "total_bedrooms float64\n", "population int32\n", "households int32\n", "median_income float64\n", "median_house_value int32\n", "ocean_proximity category\n", "dtype: object" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## De-duplicating data" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", " ... \n", "20635 False\n", "20636 False\n", "20637 False\n", "20638 False\n", "20639 False\n", "Length: 20640, dtype: bool" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.duplicated()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "0 True\n", "1 True\n", "2 True\n", "3 True\n", "4 True\n", " ... \n", "20635 True\n", "20636 True\n", "20637 True\n", "20638 True\n", "20639 False\n", "Name: ocean_proximity, Length: 20640, dtype: bool" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.ocean_proximity.duplicated(\"last\")" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", " ... \n", "3041 True\n", "19520 True\n", "22 True\n", "1215 True\n", "8840 True\n", "Length: 20645, dtype: bool" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.append(df.sample(5)).duplicated()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "df_dup = df.append(df.sample(5))" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", " ... \n", "14693 True\n", "14572 True\n", "7974 True\n", "4140 True\n", "20393 True\n", "Length: 20645, dtype: bool" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_dup.duplicated()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximity
0-122.2337.8841880129.03221268.3252452600NEAR BAY
1-122.2237.862170991106.0240111388.3014358500NEAR BAY
2-122.2437.85521467190.04961777.2574352100NEAR BAY
3-122.2537.85521274235.05582195.6431341300NEAR BAY
4-122.2537.85521627280.05652593.8462342200NEAR BAY
.................................
20635-121.0939.48251665374.08453301.560378100INLAND
20636-121.2139.4918697150.03561142.556877100INLAND
20637-121.2239.43172254485.010074331.700092300INLAND
20638-121.3239.43181860409.07413491.867284700INLAND
20639-121.2439.37162785616.013875302.388689400INLAND
\n", "

20640 rows × 10 columns

\n", "
" ], "text/plain": [ " longitude latitude housing_median_age total_rooms total_bedrooms \\\n", "0 -122.23 37.88 41 880 129.0 \n", "1 -122.22 37.86 21 7099 1106.0 \n", "2 -122.24 37.85 52 1467 190.0 \n", "3 -122.25 37.85 52 1274 235.0 \n", "4 -122.25 37.85 52 1627 280.0 \n", "... ... ... ... ... ... \n", "20635 -121.09 39.48 25 1665 374.0 \n", "20636 -121.21 39.49 18 697 150.0 \n", "20637 -121.22 39.43 17 2254 485.0 \n", "20638 -121.32 39.43 18 1860 409.0 \n", "20639 -121.24 39.37 16 2785 616.0 \n", "\n", " population households median_income median_house_value \\\n", "0 322 126 8.3252 452600 \n", "1 2401 1138 8.3014 358500 \n", "2 496 177 7.2574 352100 \n", "3 558 219 5.6431 341300 \n", "4 565 259 3.8462 342200 \n", "... ... ... ... ... \n", "20635 845 330 1.5603 78100 \n", "20636 356 114 2.5568 77100 \n", "20637 1007 433 1.7000 92300 \n", "20638 741 349 1.8672 84700 \n", "20639 1387 530 2.3886 89400 \n", "\n", " ocean_proximity \n", "0 NEAR BAY \n", "1 NEAR BAY \n", "2 NEAR BAY \n", "3 NEAR BAY \n", "4 NEAR BAY \n", "... ... \n", "20635 INLAND \n", "20636 INLAND \n", "20637 INLAND \n", "20638 INLAND \n", "20639 INLAND \n", "\n", "[20640 rows x 10 columns]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_dup[~df_dup.duplicated()]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise\n", "Try generating unique values in the median age from the dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[...]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Additional Resources" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- [Pandas AsType](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html)\n", "- [Pandas Duplicated Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }